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^— ( \ Abstract 

Cn \ Recent spectral clustering methods are a propular and powerful technique for data clustering. 

t^ . These methods need to solve the eigenproblem whose computational complexity is 0{ri?), where 

n is the number of data samples. In this paper, a non-eigenproblem based clustering method 

is proposed to deal with the clustering problem. Its performance is comparable to the spectral 

clustering algorithms but it is more efficient with computational complexity 0{n?). We show that 

with a transitive distance and an observed property, called K-means duality, our algorithm can be 

t/3 , used to handle data sets with complex cluster shapes, multi-scale clusters, and noise. Moreover, 

^, ' no parameters except the number of clusters need to be set in our algorithm. 
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m ■ 1 Introduction 

Data clustering is an important technique in many applications such as data mining, image processing, 
t*^ , pattern recognition, and computer vision. Much effort has been devoted to this research [12], [9], [15], 

^^ ' [13], [8], [3], [18], [1]. A basic principle (assumption) that guides the design of a clustering algorithm 



X 



Consistency; Data within the same cluster are closed to each other, while data belonging to different 
5h ' clusters are relatively far away. 

According to this principle, the hierarchy approach [10] begins with a trivial clustering scheme where 
every sample is a cluster, and then iteratively finds the closest (most similar) pairs of clusters and merges 
them into larger clusters. This technique totally depends on local structure of data, without optimizing 
a global function. An easily observed disadvantage of this approach is that it often fails when a data 
set consists of multi-scale clusters [18]. 

Besides the above consistency assumption, methods like the K-means and EM also assume that a 
data set has some kind of underlying structures (hyperellipsoid-shaped or Gaussian distribution) and 
thus any two clusters can be separated by hyperplanes. In this case, the commonly- used Euclidean 
distance is suitable for the clustering purpose. 

With the introduction of kernels, many recent methods like spectral clustering [13], [18] consider 
that clusters in a data set may have more complex shapes other than compact sample clouds. In this 
general case, kernel-based techniques are used to achieve a reasonable distance measure among the 
samples. In [13], the eigenvectors of the distance matrix play a key role in clustering. To overcome the 
problems such as multi-scale clusters in [13], Zelnik-manor and Perona proposed self-tuning spectral 
clustering, in which the local scale of the data and the structure of the eigenvectors of the distance 
matrix are considered [18]. Impressive results have been demonstrated by spectral clustering and it is 



regarded as the most promising clustering technique [17]. However, most of the current kernel related 
clustering methods, including spectral clustering that is unified to the kernel K-means framework in 
[5], need to solve the eigenproblem, suffering from high computational cost when the data set is large. 

In this paper, we tackle the clustering problem where the clusters can be of complex shapes. By using 
a transitive distance measure and an observed property, called K-means duality, we show that if the 
consistency condition is satisfied, the clusters of arbitrary shapes can be mapped to a new space where 
the clusters are more compact and easier to be clustered by the K-means algorithm. With comparable 
performance to the spectral algorithms, our algorithm does not need to solve the eigenproblem and is 
more efficient with computational complexity 0{n^) than the spectral algorithms whose complexities 
are 0{7V^), where n is the number of samples in a data set. 

The rest of this paper is structured as follows. In Section 2, we discuss the transitive distance 
measure through a graph model of a data set. In Section 3, the duality of the K-means algorithm is 
proposed and its application to our clustering algorithm is explained. Section 4 describes our algorithm 
and presents a scheme to reduce the computational complexity. Section 5 shows experimental results 
on some synthetic data sets and benchmark data sets, together with comparisons to the K-means 
algorithm and the spectral algorithms in [13] and [18]. The conclusions are given in Section 6. 

2 Ultra-metric and Transitive Distance 

In this section, we first introduce the concept of ultra-metric and then define one, called transitive 
distance, for our clustering algorithm. 

2.1 Ultra- metric 

An ultra-metric D for a set of data samples V — {xi\i = 1, 2, • • • , n} C i?' is defined as follows: 

1) D : V X V ^ R is a mapping, where R is the set of real numbers. 

2) D{x,,Xj) > 0, 

3) D{xi, Xj) = if and only if Xi = Xj, 

4) D{xi,Xj) == D{xj,Xi), 

5) D{xi,Xj) < Taax{D{xi,Xk), D{xk,Xj)} for any x^, Xj, and Xk in V. 

The last condition is called the ultra-metric inequality. The ultra-metric may seem strange at the 
first glance, but it appears naturally in many applications, such as in semantics [4] and phylogenetic 
tree analysis [14]. To have a better understanding of it, we next show how to obtain an ultra-metric 
from a traditional metric where the triangle inequality holds. 

In Fig. 1, the distance between samples Xp and Xq is larger than that between Xp and Xg from the 
usual viewpoint of the Euclidean metric. A more reasonable metric on the data set should give a closer 
relationship (thus smaller distance) between Xp and Xq than that between Xp and Xg since Xp and Xq 
lie in the same cluster but Xp and Xs do not. A common method to overcome this difhculty is to create 
a non-linear mapping 

(t):V CR^ ^V C R', (1) 

such that the images of any two clusters in i?" can be split linearly. This method is called the kernel 
trick and is overwhelmingly used in recent clustering schemes. Usually the mapping that can reach 
this goal is hard to find. Besides, another problem arises when the size of the data set increases; these 
schemes usually depend on the solution to the eigenproblem, the time complexity of which is 0{n^) 
generally. 

Can we have a method that can overcome the above two problems and still achieve the kernel effect? 
In Fig. 1(a), we observe that Xp and Xq are in the same cluster only because the other samples marked 
by a circle exist; otherwise it makes no sense to argue that Xp and Xq are closer than Xp and Xg- In 
other words, the samples marked by a circle contribute the information to support this observation. 
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Figure 1: (a) A two- moon data set used to demonstrate the transitive distance, where samples of one 
cluster are denoted by circles and samples of another cluster are denoted by dots, (b) Maps of transitive 
distance matrices with different orders. 



Let us also call each sample a messenger. Take Xu as an example. It brings some messsage from Xp 
to Xq and vice versa. The way that Xp and Xq are closer than the Euclidean distance between them can 
be formulated as 

D{xp,Xq) <-ma^{d{xp,Xu),d{xu,Xq)}, (2) 

where rf(-, •) is the Euclidean distance between two samples, and D{-, ■) is the distance we are trying to 
find that can reflect the true relationship between samples. In (2), Xu builds a bridge between Xp and 
Xq in this formulation. When more and more messengers come in, we can define a distance through k 
of these messengers. Let V = Xu-^Xu2 ■ • ' ^u^ be a path with k vertices, where x^ = Xp and x„j. = Xq. 
A distance between Xp and Xq with V is defined as 



D-p{Xp,Xq) 



max {dixn 



^^)}■ 



(3) 



We show an example in Fig. 1(a), where a path V from Xp to Xq is given. The new distance between 
Xp and Xq through V equals (i(x„, Xt,), which is smaller than the original distance d{xp,Xq). For samples 
Xp and Xs, there are also paths between them, such as the path Q, which also result in new distances 
between them smaller than d{xp,Xs). However, no matter how the path is chosen, the new distance 
between Xp and Xg is always larger than or equal to the smallest gap between the two clusters as follows. 

Given two samples in a data set, we can have many paths connecting them. Therefore we define the 
new distance, called the transitive distance, between two samples as follows. 

Definition 1. Given the Euclidean distance d{-,-), the derived transitive distance between samples 
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G V with order k is defined as 



Dk{x. 



P' 



Co) = min max|(i(e)|, 



(4) 



where Pfe is the set of paths connecting Xp and Xq, each such path is composed of at most k vertices, 



e = XiXj, and d{e) ~ d{xi,Xj). 



In Fig. 1(b), we show the maps of transitive distance matrices for the data set in Fig. 1(a) with 
orders from 1 to 6, where a larger intensity denotes a smaller transitive distance. In this data set, there 



are 50 samples, and the samples in each cluster are consecutively labeled. From these maps, we can see 
that when k is larger, the ratios of the inter-cluster transitive distances to the intra-clustcr transitive 
distances tend to be larger. In other words, if more messengers are involved, the obtained transitive 
distances better represent the relationship among the samples. 

When the order k = n, where n is the number of all the samples, we denote £)„ with D for simplicity. 
The following proposition shows that D is an ultrametric. 

Proposition 1. The transitive distance D is an ultrametric on a given data set. 

The proof of Proposition 1 is simple and omitted here. So given a data set V and its distance 
matrix E, we can obtain another ultrametric distance matrix E' through Definition 1. In [6], an 0{n^) 
algorithm is given to derive E' from E. In Section 4, we propose an algorithm which is almost 0{n^) 
to obtain E' . 

It is worth mentioning that although we use d(-, •) to denote the Euclidean distance for convenience in 
the previous discussion, we can replace d{-, ■) with any other traditional distance (metric) in Definition 1 
and still have Proposition 1. Therefore, in what follows, d{-, •) is used to denote any traditional distance. 

2.2 Kernel Trick by the Transitive Distance 

In this section, we show that the derived ultra-metric well reflects the relationship among data samples 
and a kernel mapping with a promising property can be obtained. First we introduce a lemma from 
[11] and [7]. 

Lemma 1. Every finite ultrametric space consisting of n distinct points can be isometrically embedded 
into a n — \ dimensional Euclidean space. 

With Lemma 1, we have the mapping^ 

(/>: (T/Ci?',D)^(F' Ci?^rf'), (5) 

where (j){xi) = a;^ G V , s ~ n — 1, and n is the number of points in a set V. We also have 
d'{4>{xi),(f){xj)) = D{xi,Xj), where d'{-,-) is the Euclidean distance in i?", i.e., the Euclidean distance 
between two points in V' equals its corresponding ultrametric distance in V. 

Before giving an important theorem, we define the consistency stated in Section 1 precisely. 

Definition 2. A labeling scheme {{xi,li)} of a data set V = {xi\i ~ 1,2,- •• ,n}, where li is the 
cluster label of Xi, is called consistent with some distance d{-,-) if the following condition holds: for 
any y ^ C and any partition C ~ Ci U C2, we have d{Ci,C2) < d{y,C), where C C V is some 

de f 

cluster, y £ V, d{Ci,C2) = miuxjeci d{xi,Xj) is the distance between the two sets Ci and C2, and 

de f 

d{y,C) = TtmixeC d{y , x) is the distance between a point y and the set C. 

The consistency requres that the intra-clustcr distance is strictly smaller than the inter-cluster 
distance. This might be too strict in some practical applications, but it helps us reveal the following 
desirable property for clustering. 

Theorem 1. If a labeling scheme of a data set V = {xi\i = 1, 2, • • • , n}, is consistent with a distance 
d{-, ■), then given the derived transitive distance D and the embedding (p : {V,D) —>■ {V',d'), the convex 
hulls of the images of the clusters in V' do not intersect with each other. 

The proof of the theorem can be found in Appendix A. An example of the theorem is illustrated in 
Fig. 2. A data set V with 50 points in R^ is mapped (embedded) into i?^^, a much higher dimensional 
Euclidean space, where the convex hulls of the two clusters do not intersect. Moreover, the Euclidean 
distance between any two samples in V' is equal to the transitive distance between these two samples 
in V. The convex hulls of the two clusters intersect in R^ but do not in i?**^, meaning that they are 
linearly separable in a higher dimensional Euclidean space. We can see that the embedding (j) is a 
desirable kernel mapping. 



^We use d(-, ■) to denote a traditional distance in V and d'{-, •) the Euclidean distance in V' . 




Figure 2: Mapping a set of 50 data samples in V C R^ to V' C R 
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Figure 3: (a) Clustering result obtained by the K-means algorithm on the original data set V. (b) 
Clustering result obtained by the K-means algorithm on Z derived from the distance matrix of V. 
Only one sample has different labelings from the two results. 

Obviously, the clustering of V' is much easier than the clustering of V. It seems that the K-means 
algorithm can be used to perform the clustering of V easily. Unfortunately, we only have the distance 
matrix E' = [d'^A ~ [Dij] of V , instead of the coordinates of x'^ e V' , which are necessary for the 
K-means algorithm. In Section 3, we explain how to circumvent this problem. 



3 K-Means Duality 

Let E = [dij] be the distance matrix obtained from a data set V = {xi\i — 1, 2, • • • ,n}. From E, we 
can derive a new set Z = {zi\i = 1, 2, • • • , n}, with Zi £ i?" being the ith row of E. Then we have the 
following observation, called the duality of the K-means algorithm. 

Observation (K-means duality): The clustering result obtained by the K-means algorithm on Z is very 
similar to that obtained on V if the clusters in V are hyperellipsoid-shaped. 

We have this observation based on a large number of experiments on different data sets. Most data 
sets were randomly generated with multi-Gaussian distributions. From more than 100 data sets where 
each set contains 200 samples, we compared the results obtained by the K-means alogrithms on original 
data sets V^s and their corresponding sets Z's. As a whole, the sample labeling difference is only 0.7%. 
One example is shown in Fig. 3, in which only one sample is labeled differently by the two clustering 
methods. 

The matrix perturbation theory [16] can be used to explain this observation. We begin with an ideal 
case by supposing that the inter-cluster sample distances are much larger than the intra-clustcr sample 
distances (obviously, the clustering on this kind of data sets is easy) . In the ideal case, let the distance 
between any two samples in the same cluster be 0. If the samples are arranged in such a way that those 



in the same cluster are indexed by successive integers, then the distance matrix will be such a matrix: 



E^ 



( El \ } rii rows 

Eo 



V E, J 



} 712 rows , , 

(6) 

} rifc rows 



where Ei — 0,1 < i < k, represents the distance matrix within the ith cluster, ni + n2 + ■ ■ ■ + Uk = n, 
and k denotes the number of clusters. Let Z ~ {zi\i ~ 1,2, •• • ,n} with Zi being the zth row of E. 
Then in this ideal case, we have zi = Z2 = • • • = ^m , ^m+i = 2ni+2 = • • • = ^m+na r • • , -Zn-rifc+i = 
Zn-nk+2 = ■ ■ ■ ~ Zn- Therefore, if Z is considered as a data set to be clustered, the distance between 
any two samples in each cluster is still 0. On the other hand, for two samples in different clusters, say, 
zi and z„j+i, we have 

rii 

zi = (0, • • • ,0,di^„i+i, • • • ,rfi^„i+„2, • • • ), (7) 

Zm + l — (dni + l,lj • ■ • , dni + l,nn 0, • • • ,0, dni + l,ni+n2 + l j ' ' ' )i \°) 

"2 

and 

d(zi,z„i+i) > 



ni+n2 ni 

\j=ni+l j=l 

Thus, the distance between any two samples in different clusters is still large. The distance relation- 
ship in the original data set is preserved completely in this new data set Z. Obviously, the K-means 
algorithm on the original data set can give the same result as that on Z in this ideal case. In general 
cases, a perturbation P is added to E, i.e., E = E + P, where all the diagonal elements of P are zero. 
The matrix perturbation theory [16] indicates that the K-nieans clustering result on the data set Z 
that is derived from E is similar to that on Z if P is not dominant over E. Our experiments and the 
above analysis support the observation of the K-means duality. 

Now we are able to give a solution to the problem mentioned at the end of Section 2.2. From 
Theorem 1, we can map a data set F to F' C i?"~^ where the clustering is easier if the clusters with 
the original distance are consistent in V. The problem we need to handle is that in _R"^^ we only have 
the distance matrix instead of the coordinates of the samples in V' . From the analysis of the K-means 
duality in this section, we can perform the clustering based on the distance matrix by the K-means 
algorithm. Therefore, the main ingredients for a new clustering algorithm are already available. 

4 A New Clustering Algorithm 

Given a data set V = {xi\i = 1, 2, • • • , n}, our clustering algorithm is described as follows. 

In step 2), we need to compute the transitive distance with order n between any two samples in V, 
or equivalently, to find the transitive edge, which is defined below. 

Definition 3. For a weighted complete graph G — (V, E) and any two vertices Xp,Xq G V , the transitive 
edge for the pair Xp and Xq is an edge e — XuXv, such that e lies on a path connecting Xp and Xq and 

i^pq — ^\Xp, Xq J — ayXii , Xy j . 

An example of a transitive edge is shown in Fig. 1(a). Because the number of paths between two 
vertices (samples) is exponential in the number of the samples, the brutal searching for the transitive 
distance between two samples is infeasible. It is necessary to design a faster algorithm to carry out this 
task. The following Theorem 2 is for this purpose. 

Without loss of generality, we assume that the weights of edges in G are distinct. This can be 
achieved by slight perturbations of the positions of the data samples. After this modification, the 
clustering result of the data will not be changed if the perturbation are small enough. 



Algorithm 1 Clustering Based on the Transitive Distance and the K-means Duality 

1) Construct a weighted complete graph G = {V,E) where E = [rfyjnxn is the distance matrix 
containing the weights of all the edges and dij is the distance between samples Xi and Xj . 

2) Compute the transitive distance matrix E' ~ [d[A — \Dij\ based on G and Definition 1, where 
Dij is the transitive distance with order n between samples Xi and Xj . 

3) Perform clustering on the data set Z' = {z[\i = 1, 2, • • • ,n} with z^ being the ith row of E' by 
the K-means algorithm and then assign the cluster label of z'^ to Xi, i — 1,2, ■ ■ ■ , n. 



Theorem 2. Given a weighted complete graph G — (V, E) with distinct weights, each transitive edge 
lies on the minimum spanning tree G ~ (V, E^ of G. 

The proof of Theorem 2 can be found in Appendix B. This theorem suggests an efficient algorithm 
to compute the transitive matrix E' — [d'^Anxn which is shown in Algorithm 2. Next we analyze the 
computational complexity of this algorithm. 

Algorithm 2 Computing the transitive distance matrix E' = [(i^.]„xn 

1) Build the minimum spanning tree G ~ {V, E) from G = {V, E). 

2) Initialize a forest F ^ G. 

3) Repeat 

4) For each tree T e F do 

5) Cut the edge with the largest weight wt and partition T into Ti and T2. 

6) For each pair {xi,Xj), xi G Ti, Xj G T2 do 

7) d[^ ^ WT 

8) End for 

9) End for 

10) Until each tree in F has only one vertex. 



Building the minimum spanning tree from a complete graph G needs time very close to 0(n^) by the 
algorithm in [2]^. When Algorithm 2 stops, total n non-trivial tree^ have been generated. The number 
of the edges in each non-trivial tree is not larger than n. Therefore, the total time taken by searching 
for the edge with the largest weight on each tree (step 5) in the algorithm is bounded by O(n^). Steps 
6-8 are for finding the values for the elements of E' . Since each element of E' is visited only once, 
the total time consumed by steps 6-8 is O(n^). Thus the computational complexity of Algorithm 2 is 
about O(n^). 



^The fastest algorithm [2] to obtain a minimum spanning tree needs 0{ea{e, n)) time, where e is the number of edges 
and o(e, n) is the inverse of the Ackermann function. The function a increases extremely slowly with e and n, and 
therefore in practical applications it can be considered as a constant not larger than 4. In our case, e = 0{n^) for a 
complete graph, so the complexity for building a minimum spanning tree is about 0{n^). 

'^A non-trivial tree is a tree with at least one edge. 




(a) (b) 

Figure 4: (a) The minimuiii spanning tree and the clustering result by our algorithm, (b) The minimum 
spanning tree and the clustering result by the hierarchical clustering. The dashed lines are the cutting 
edges. The number of clusters is 3. 

Considering the time O(n^) for building the distance matrix E, and the fact that the complex- 
ity of the K-means algorithm'* is close to 0{n^), we conclude that the computational complexity of 
Algorithm 1 is about 0(71"^). 

Although the minimum spanning tree is used to help clustering in both the hierarchical clustering and 
our algorithm, the motivations and effects are quite different. In our case, the minimum spanning tree 
is for generating a kernel effect (to obtain the relationship among the samples in a high dimensional 
space according to Theorem 1), with which the K-means algorithm provides a global optimization 
function for clustering. Whereas in the hierarchical clustering, each iteration step only focuses on the 
local sample distributions. This difference leads to distinct algorithms in handling the data obtained 
from the minimum spanning tree. We carry out the K-means algorithm on the derived Z' according 
to the K-means duality, while the hierarchical clustering cuts c — 1 largest edges from the minimum 
spanning tree, where c is the number of clusters. In Fig. 4, we show a data set clustered by the two 
approaches. The multi-scale data set makes the hierarchical clustering give an unreasonable result. 

5 Experiments 

We have applied the proposed algorithm to a number of clustering problems to test its performance. The 
results are compared with those by the K-means algorithm, the NJW spectral clustering algorithm [13] 
and the self-tuning spectral clustering algorithm [18]. For each data set, the NJW algorithm needs 
manually tuning of the scale and the self-tuning algorithm needs to set the number of nearest neighbors. 
On the contrary, no parameters are required to set for our algorithm. In this comparisons, we show 
the best clustering results that are obtain by adjusting the parameters in the two spectral clustering 
algorithms. All the numbers of clusters are assumed to be known. 

5.1 Synthetic Data Sets 

Eight synthetic data sets are used in the experiments. Bounded in a region (0, 1) x (0, 1), these data 
sets are with complex cluster shapes, multi-scale clusters, and noise. The clustering results are shown 
in Fig. 5. Note that the results obtained by the K- means algorithm are not given because it is obvious 
that it cannot deal with these data sets. 

In Figs. 5(a)-(c), all the three algorithms obtain the same results. Figs. 5(d)-(f) and (g)-(i) show 
three data sets on which the self-tuning algorithm gives different results from the other two algorithms. 



*The time complexity of the K-means algorithm is 0{npq), where p and q are the number of iterations and the 
dimension of the data samples, respectively. The data set Z' in Algorithm 1 is in i?" and thus q = n. In practical 
applications, p can be considered as smaller than a fixed positive number. 





oo 


®°^ 














°°^ o 




o o o 




o„ 




. o ®o 




o 




o Oo 


," ' o 


■ ■ ° o» - 





o 


. - ■, 0, ■ ,o- . 




«. ° 




0° 0° 




S«l 




(a) 



' eT?. 



%° 



o 

OQ O 


o°cP 


o * * ., 




o 








® 








® 






■ 


a © 








® ® 








® 








® 
® «. ® 


® 


e 


■ (b) 





, » ■>" >^ "■: 






xJ" X *^ , 




v' 


^ X 










X ^ 




i" 








^" 


..-'. 


X* 


^ 


■ .\ .i 


,1. 

^ ■ 


x^ 




"«. 


^ ^ 




X 


'^AX 


« { 


x^ 




"; fx""'^''^ ' 


(c) 



(d) 



(g) 



^ + 



J 



» "k «xx 

(e) 









. 


- 






i % 








m *' 




V,w<**' 




■ 




(f) 



^ 

T 



#f* 



(h) 





9 

















O 








rj 




- 


©jS 


^^ 






i 


o 
o 




" 




■® 


® 


0) 



,# 



^, 




■%B- ® 



o o o 



(gi) 



(k) 



. ■ 


® 


@ 






^,=, " 




















X 




,_.; ^ ®i-r0 


^ 


X _ 




K 








i ■ ««>; 








;^ 
















. ., > ^x X 






® 


#ib->":" 


X 


■^ 


i? 












X 






® X 




X 




® 




(n) ■ 










® 









Figure 5: Clustering results by our algorithm and the two spectral algorithms. (a)(b)(c) Results by 
the three algorithms. (d)(e)(f) Results by the NJW algorithm and ours. (g)(h)(i) Results by the self- 
tuning algorithm, (j) Result by the NJW algorithm, (k) Result by the self-tuning algorithm and ours. 
(l)(m)(n) Results by our algorithm, the NJW algorithm, and the self-tuning algorithm, respectively. 
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Digits selected in the sets 

Figure 6: The error rates of the four algorithms on the ten data sets constructed from the USPS 
database. 

The self-tuning algorithm fails to cluster the data sets no matter how we tune its parameter. Figs. 5(j) 
and (k) show two clustering results where the data set is with multi-scale clusters. The former is 
produced by the NJW algorithm and the latter by the self-tuning and our algorithms. To cluster 
the data set in Figs. 5(l)-(n) is a challenging task, where two relatively tightly connected clusters are 
surrounded by uniformly distributed noise samples (the third cluster) . Our algorithm obtains the more 
reasonable result (Fig. 5(1)) than the results by another two algorithms (Figs. 5(m) and (n)). 

From these samples, we can see that our algorithm performs similar to or better than the NJW 
and self-tuning spectral clustering algorithms. This statement applies to many other data sets we have 
tried, which arc not shown here due to the limitation of space. 

5.2 Data Sets from the USPS Database 

USPS database is an image database provided by the US Postal Service. There are 9298 handwriting 
digit images of size 16 x 16 from "0" to "9" in the database, from which we construct ten data sets 
from this database. Each set has 1000 images selected randomly with two, three, or four clusters. 
Each image is treated as a point in a 256-dimensional Euclidean space. The following figure shows the 
error rates of the four algorithms on these sets. In this experiments, the parameters for the NJW and 
self-tuning algorithms are tuned carefully to obtain the smallest error rates. These results show that 
as a whole, our algorithm achieves the smallest error rate, and the K-means and self-tuning algorithms 
perform worst. 

5.3 Iris and Ionosphere Data Sets 

We also test the algorithms on two commonly-used data sets. Iris and Ionosphere, in UCI machine 
learning database. Iris consists of 150 samples in 3 classes, each with 50 samples. Each sample has 4 
features. Ionosphere contains 354 samples in 2 classes and each sample has 34 features. In Table 1 we 
show the error rates of the four algorithms clustering on these data sets. For the NJW and self-tuning 
algorithms, we have to adjust their parameters [S and N)^ to obtain the smallest error rates, which 
are shown in the table. Our algorithm results in the smallest error rates among the four algorithms. 

5.4 Remarks 

From the experiments, we can see that compared with the K-means algorithm, our algorithm and the 
spectral algorithms can handle the clustering of a data set with complex cluster shapes. Compared 



®We tried different S from 0.01 to 0.1 with step 0.001 and 0.1 to 4 with step 0.1, and different N from 2 to 30 with 
step 1. 
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Table 1 : Error rates of the four algorithms on Iris and Ionosphere data sets 





K-mcans 


NJW 


Self-tuning 


Ours 


Iris 
Ionosphere 


0.11 
0.29 


0.09 {6 = 0.40) 
0.27 (S = 0.20) 


0.15 {N = 5) 
0.30 {N = 6) 


0.07 
0.15 



with the spectral algorithms, our algorithm has comparable or better performance and does not need 
to adjust any parameter. In the above experiments, since we have the ground truth for each data 
set, we can try different parameters in the NJW and self-tuning algorithms so that they produce the 
best results. However, we do not know which parameters should be the best for unsupervised data 
clustering in many applications. Another advantage of our algorithm over the spectral algorithms is 
that its computational complexity is close to O(n^), while the spectral algorithms' complexities are 
0{n^). 

6 Conclusion 

In this paper, we have built a connection between the transitive distance and the kernel technique 
for data clustering, By using the transitive distance, we show that if the consistency conditions is 
satisfied, the clusters of arbitrary shapes can be mapped to a new space where the clusters are easier 
to be seperated. Based on the observed K-means duality, we have developed an efRcient algorithm 
with computational complexity 0{'n^). Compared with the two popular spectral algorithms whose 
computational complexities are 0{n^), our algorithm is faster, without the need to tune any parameters, 
and performs very well. Our algorithm can be used to handle challenging clustering problems where 
the data sets are with complex shapes, multi-scale clusters, and noise. 

7 Appendix A: Proof of Theorem 1 

It is reasonable to assume that each cluster has at least two samples. Let Xi, Xj G C, Xk ^ C, Xi, Xj, 
Xk G V, where C C T^ is some cluster. Then their images after the mapping are x^, x',, x'f, e V', 
where x^, x'j e C , x'^. 



C", and C" = 0(C). 



(i) First, we verify that if d'{x'^,x'A ^ do E i?+, then there exists a partition Ci U C2 = C such that 
d{Ci, C2) ^ do. Such a partition can be obtained by the following steps: 

1) Initiahze i7 = C, m == 1, Ci = 0, and C2 = 0. 

2) Find a path V including the transitive edge from Xi to Xj in H. 

3) Cut the transitive edge on the path V. Let Vm {Qm) be the set consisting of the samples 
on V that are on the same side with Xi (xj) after the cutting, except Xi (xj). 

4) Ci^CiU Vm, C2 ^ C2 U Q™, H ^ H\{Vm U Q™}, and m ^ m -Fl. 

5) Repeat 2), 3), and 4) until only Xi and Xj are left in H . 

6) Vm ^ {x^}, Qm <- {xj}, Ci <- Ci U Vm, and C2 ^ C2 U Qm- 

In this procedure, from (4) we can see that d{Vs,Qt) ^ d.'{x^,x'A, 1 ^ s,t ^ m. Since C\ — 
T'l U 7^2 U • • • U -P™ and C2 = Qi U Q2 U • • • U Qm, we have d(Ci, C2) = mmi<^s.t^m{d{Vs, Qt)}- 
Thus,d{Ci,C2)^d'{x'^,x'j) ^da. 

(ii) Second, we show that there exist x„ S C and x^ ^ C such that d'(x'^,x'i^) ^ d(x„,Xt,). From 
Definition 1, we have a path V connecting Xi and x^ including the transitive edge. Then there 
exists an edge x^x^ e V such that x„ G C and Xy ^ C, and from (4), we have (i'(x^,x'^) ^ 

^\Xu, Xy j. 
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(iii) Third, wc show that 

d'{x'^,x'j) i^mm{d'{x'^,x'^.),d'{x'j,x'k)}. (10) 

Assume, to the contrary, that d'(x[,x'A > d'{x'^,x'^). From (i) and (ii), we have a partition 
Ci U C2 = C, and x„ £ C, x„ ^ C such that d{Ci,C2) ^ d!{x\,x'^) and di!{x\,x'^ ^ d(x„,a;^). 
Thus (i(Ci, C2) ^ d' {x[, x'A > d' {x[, x'^^) ^ d{xu, Xy) ^ d{C, Xy), which contradicts the consistency 
of V. Therefore, (10) holds. 

(iv) Let C = {xci,--- ,Xc^} be a cluster in V, with its image C" = (j){C) ~ {x'^^,--- ,x'^^} C V . 
Let C" be the convex hull of C". Now we verify that no samples not in C" are in C". Assume, 
to the contrary, that there exists a sample y' £ C", y ^ C*'- Consider a sample x' G C". Let P 
be the hyperplane, each point on which has the same distance to x' and z'. Then there must 
exist another sample z' G C" such that y' and z' are in the same side of P, which leads to 
d'{x',z') > d'{y',z'), a contradiction to (10). 

In (iv), we have verified that for any cluster C" G V , no samples from other clusters can be in the 
convex hull of C". Thus, the convex hulls of all the clusters in V' are not intersecting each other. 

8 Appendix B: Proof of Theorem 2 

For any two distinct vertices xi and X2 in G, let P — x^-^Xk^ ■ ■ ■ x^^ be the path connecting them 
including the transitive edge Xk^Xki^-n where ki — 1 and ks ~ 2. Then from Definition 1, we have 

d{xk^,xu^^^) < d{xk^,Xk,+-,), m = 1,2, • • • ,2 - l,z + 1,- • • ,s. (11) 

Next we verify that the edge Xk^Xk^^-^ is in G. Let G-p ^ G U P. Assume, to the contrary, that 
Xki^ki^i ^ G. Then the edge Xk^Xki^i must be on a loop O C G-p. Consider the following two cases: 

(i) For any edge XuXy G G n O, d{xu,Xv) < d{xk,,Xk^^i). 

(ii) There exists an edge xi.xi.^^ G G fl O such that d{xi.,xi.^^) > d{xki, ^fci+i)- 

Suppose that case (i) is true. Then for any edge on the path {P U 0)\{xkiXki^i} that also connects 
xi and X2, we have its length smaller than the transitive edge for xi and X2- Thus case (i) cannot be 
true. _ _ 

Suppose that case (ii) is true. Since G* = (G U {xkiXki^i})\{xi.xi.^^} is a spanning tree of G, and 

the sum of the edge weights in G* is smaller than that in G, we have a contradiction to the fact that 
G is the minimum spanning tree. Thus case (ii) cannot be true either, which completes the proof. 
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